Code Review Benchmark Dashboard

Quick Presets:

Highest Precision ↗

High Risk + File Context (Precision) ↗

Best for Complex Code (Precision) ↗

Python + Medium Risk (Precision) ↗

Best for Concurrency (Precision) ↗

Typescript + Scheduling (Recall) ↗

Best for Performance Optimization ↗

Best for Bug Fixes (Recall) ↗

Best for Small Go PRs ↗

Java + Authentication ↗

Small PRs + Performance Optimization (Precision) ↗

Best for Medium Ruby PRs ↗

Best for Bug Fixes ↗

Typescript + Correctness ↗

Ruby + Medium PRs (Recall) ↗

Best for Ui ↗

Best for Reliability ↗

Best for Concurrency ↗

Ruby + Correctness ↗

Best for Caching ↗

Best for Go ↗

Best for Small PRs ↗

Best for Scheduling ↗

Best for File Context ↗

Best for Security ↗

Security Critical ↗

Best for High Risk ↗

Best for Python ↗

Best for Medium Python PRs ↗

Best for Moderate Bugs ↗

Highest Recall ↗

Best for Authentication ↗

High Risk Auth ↗

Best for Critical Risk ↗

Best for Medium Java PRs ↗

Best for Features ↗

Best for Moderate Code ↗

Best for Complex Code ↗

Complex & Subtle ↗

Best for Correctness ↗

Best for Java ↗

Best for Large PRs ↗

Highest F1 ↗

Best for Typescript ↗

Best for Subtle Bugs ↗

Best for Cross-File ↗

Best for Medium PRs ↗

Best for Medium Risk ↗

Best for Reliability ↗

Best for Concurrency ↗

Best for Ruby ↗

CURRENT RESULTS

All Languages

Performance Metrics

#	Tool	Precision (%)	Recall (%)	F1 Score (%)	True Positives	PRs Evaluated

F1 Score by Tool

Repositories Used

The offline benchmark draws from a diverse set of open-source repositories spanning different languages, frameworks, and domains — from infrastructure and observability tools to web platforms and security projects.

This variety ensures our results reflect how AI reviewers perform across real-world codebases, not just one type of software.